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Abstract 

Binomial data with unknown sizes often appear in biological and medical sciences. The previous methods either 
use the Poisson approximation or the quasi-likelihood approach. A full likelihood approach is proposed by treat- 
ing unknown sizes as latent variables. This approach simplifies analysis as maximum likelihood estimation can be 
applied. It also facilitates us to gain a lot more insights into efficiency loss across models and estimation precision 
within models. Simulation assesses the performance of the proposed model. An application to the surviving jejunal 
crypt data is discussed. The proposed method is not only competitive with the previous methods, but also gives an 
appropriate explanation of the inflated variation of expected sizes. 
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1 Introduction 

Binomial data with unknown sizes often appear in biological and medical sciences. For example, Margolin et al. 
(1981) studied how the number of revertant colonies of Salmonella strain TA98 changed with the dosage of a chemical 
agent quinolin. Bailer and Piegorsch (2000) reviewed a C. dubia survival and reproduction toxicity test. Trajstman 
(1989) presented a data set from an experiment of a M. bovis subjected to the decontaminants. Elder (1996) investi- 
gated how the times in high heat affect the survival of V79-473 cells. 

The jejunal radiation damage is studied extensively in the literature of medical sciences. Some studies concerned 
the clinical value of some material in protecting jejunal crypts against radiation (e.g., Goel et al. 2003, Salin et al. 
2001 and Khan et al. 1997). Other studies investigated radiosensitivity of jejunal crypt stem cells (e.g., Kinashi et al. 
1997). There are also studies interested in survival of crypt epithelial cells in the jejunum of mice exposed to different 
doses of X-rays (e.g.. Mason et al. 1999). Table^presents a surviving jejunal crypt data set from an experiment done 
on 126 mice (Kim and Taylor 1994; Elder et al. 1999). In such an experiment, each mouse is exposed to a certain 
dose of gamma rays, then sacrificed to find out the number of crypts survived. The total number of crypts before the 
experiment is unknown, since the only way to know this number is to sacrifice the mouse while live mice are required 
in the experiment. 

Let yi be the number of surviving crypts in mouse i. It is appropriate to treat yi as a binomial random variable 
with size Ui and surviving probability pi, where Tij is the total number of crypts in mouse i. The issue of interest is 
to investigate how the surviving probability pi depends on the dose of gamma radiation Xi applied to mouse i. If the 
rii were known, then one could apply the classical logistic regression (e.g., McCullagh and Nelder 1999). Because 
Ui is unknown, yi can also be approximately treated as a Poisson random variable, which is a common approach 
in the literature. Such a Poisson approximation is crude when pi is moderately large (e.g.. Elder et al. 1999). By 
putting additional assumptions on the rii, Kim and Taylor (1994) and Elder et al. (1999) developed a quasi-likelihood 
approach. Kim and Taylor (1994) considered that E{ni) — rrii and var(ni) = niii' with nii known and ly ^ 1 
unknown. Elder et al. (1999) considered estimating m = E{ni) with var [rii) = r7i(l + lym) and u ^ 0. 

We will assume that each is a Poisson random variable with mean A; and that the arise as a random sample 
from a mixing distribution. In particular, a gamma distribution will be used in this article. By doing this, the require- 
ment of prior knowledge about the E{ni) in Kim and Taylor (1994) is removed. Compared to the quasi-likelihood 
approach in Elder et al. (1999), our approach simplifies analysis as standard techniques, i.e., maximum likelihood 
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Table 1: The jejunal crypt data (dose: the dose of gamma radiation in Gy; count: the surviving number of crypts of a 
mouse). 



dose coLinl 

6^5 76, 96,73,81,81,87,77,75 

6.50 75, 80, 67, 86, 70, 78, 88, 76, 54, 58, 76, 69, 61, 70 

6.75 66,51,48,48,57,45,59,49 

7.25 35, 33, 35, 37, 38, 53, 37, 36, 42, 45, 48, 42, 31, 36, 40, 45, 47, 38, 40, 35, 27, 35 

7.75 19,18,25,19,19,18,21,18 

8.00 19, 24, 19, 26, 18, 18, 14, 19, 11, 21, 19, 14, 16, 13 

8.25 19, 19, 19, 16, 12, 16, 12, 13 

8.75 11, 11, 7, 3, 5, 7, 9, 5, 11, 9, 6, 9, 7, 5, 10, 7, 11, 9, 7, 11, 5, 12 

9.25 6, 3, 5, 6, 4, 6, 5, 3 

9.50 1,4,5,5,3,6,3,3,5,5,1,4,3,4 



estimation, can be applied. This model also facilitates us to investigate the efficiency loss due to the rii being unknown 
and being over-dispersed and how one parameter influence estimation precision of the parameters within a model. 
Therefore, we can gain much more insights into the problem than previous methods. 

The proposed method is described in Section 2. Efficiency losses are studied in Section 3. The estimation precision 
is investigated in Section 4. A simulation study is presented in Section 5. The investigation of the jejunal crypt data is 
done in Section 6. 



2 The proposed method 

Suppose that the data consist of r pairs of {yi,Xi),i = 1, 2, ... , r, where Xj is the covariate associated with observation 
i, such that j/i Bin(nj,Pi), pj = h{xi,(3). Note that is a known link function, such as 



1 + cxp{x' j3) 



When each rii is assumed to be a Poisson random variable with mean , it is easily shown that yi\Xi ~ Pois {Xih{xi,f3)). 
We wiU further assume that the Aj arise as a random sample from a gamma density which can be written as 

r-i(a)r;"A"-i exp(-r7A), A e (0, oo), 

where a is the shape parameter and rj is the rate parameter. The mean is fi = a/r] and the variance is a'^ = a/rf. 
Note that a = (fx/a)^, that is, 1/a is the squared coefficient of variation. With (a, ^) used to parameterize gamma 
densities and 9 = (/3, fj,, a), it is clear that marginally is a negative binomial random variable (Anscombe 1949) 
with density 

^ T{a + ,j)a"{,,Jnx.f3)}" 
' ' T{a)y\{a + txh{x,/3)}'-+y- 

The log likelihood is written as 

r 

ir{e) = J2^ogf{yi;Xi,e). (1) 

i=l 

This looks much like, but is not a special case of negative binomial regression. In negative binomial regression, 
?/|A ~ Pois (A), A ^ r(/x, p? joi), and = </)(a;, /3) for some function ^. In the proposed model, ?/|A ~ Pois(A/i(a;, /3)) 
A ~ r(/x, ^ jot), and is only a parameter and has nothing to do with /3. 

Let Q be the maximum likelihood estimator (MLE) for Q. Asymptotically ^ is a multivariate normal random vector 
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with mean and variance-covariance matrix ^{6) (e.g., Lehmann and Casella 1998, chapter 6), where, with V/3 
being the gradient with respect io (3, Ir{9) is the Fisher information matrix given by 



ir{e) = 



h{xi.,p){l+a-^ iih(xi.,p)} 



= 1 1+Q-V'i(a 





h(xi,f3) 



{l+a-^tJ.h(xi,P)} 





(2) 



Because of those zero entries in Ir{d), a is orthogonal to the pair of (/3, fi). There are several consequences (e.g.. 
Cox and Reid 1987). The asymptotic standard errors of /3 and fi are not affected by treating a as either known or 
unknown. The MLEs of (/3, /i) and a are asymptotically independent. The MLEs of (3 and /i given a vary only slowly 
with a. 



3 Efficiency loss 

The parameter of interest is /3. Within the proposed model, the MLE is asymptotically fully efficient. If we knew 
the Tii, then a more precise estimation of f3 is feasible. Two kinds of efficiency losses are of interest: that originated 
from the rii being unknown, and that from the over-dispersion among the Ui . 
If we knew the n^, then the Fisher information matrix of /3 is 



When the rii are unknown and arise from a Poisson distribution with mean /x, the Fisher information matrix is 



l^^=l R^-^) 2^i=i ^ph{x,,(3} 



h(xi,p) 



(3) 



which is obtained from lr{0) in (|3 by letting a = 00. 

A sensible approach to evaluate the efficiency loss is to average Ir{fi\{niYi^i) by treating the ni as a sample from 
a Poisson distribution with mean i.e., to consider 

A numeric experiment is used to investigate the efficiency losses, in which there is a single covariate x and the 
parameter of interest is the slope /?i while the intercept /3o is fixed to be one. A 2^ design is considered, i.e., 

^"2} X {1, 2} X {100, 300} X {25, 49}, (5) 

where Xi is the set of integers in [-5, 5], and X2 is {-0.63, 1.59, -3.01,-6.85, -4.97, 1.86, -7.54, -3.45, -4.45, 
— 1.87, 6.49}, a set of 11 normal random variables with mean and variance 25. It is assumed that the number of 
replications is identical over each x value. 

Let p be the ratio of the asymptotic standard deviation of (3i calculated from Ir{(3\iJL) in (0} and that from /r(/3, /i) 
in (|3}- Let 7 be the ratio of the asymptotic standard deviation of Pi calculated from /i) in (|3} and that from 
lr{0) in (|2j. This means that fq is the ratio of the asymptotic standard deviation of [3i calculated from Ir{f3\fi) in 
and that from Ir{6) in (|2}. Table |2] presents these efficiency loss measures. The ranges of p, 7 and p7 are given by 
(0.706, 0.786), (0.732, 0.941) and (0.517, 0.740), respectively. 

Figuref^shows how the efficiency loss measure 7 changes when a varies continuously. As a increases, the gamma 
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distribution tends to be degenerated, and the efficiency loss from the over-dispersion among the decreases. In the 
four panels, the efficiency loss is at most 0.62 1 , which indicates that the efficiency loss from the over-dispersion among 
the fii is small. 



Table 2: The efficiency loss measures over 16 settings. 
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4 Estimation precision of f3 and /i 

For a Poisson sample of size r with mean /i, the information of is given by Ir{p) — f/ P- When ^ increases, the 
estimation of fi becomes less precise. On the contrary, a large fi serves a good purpose for the estimation of /3 in a 
logistic regression model, which can be clearly seen from /,.(/3|/i) in 0. 

When the Ui are unknown and arise from the a Poisson distribution with mean /i, although (3 and /i are not 
orthogonal, we will show that a large will lead to a more precise estimation of /3 but a less precise estimation of fj,. 
To this end, we will partition the asymptotic variance-covariance matrix V , i.e., the inverse matrix of /, (/3, /i) in Q, 
into a 2 X 2 block form, where, V — {Vij), and 

V22= p[2^h{x„ (3)- }_^Vph{x„f3)\^}_^ g j 2^ V;3/i(a;.,/3) 

i=i i=i i=i i=i 

The diagonal entries of fiVu and V22/ p are nonnegative. The variance of each component in /3 is a nonincreasing 
function of ^, while that of /x is a nondecreasing function of ji. 

When the rii are over-dispersed, we also conjecture that a large /i will have the same effects on the estimation of /i 
and /3 as those in the simple Poisson model. Figure|2]is a numerical illustration. 
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Figure 1: The efficiency loss measure 7 varies with a: Po = ^> x G Xi, /u) = (1, 100) (upper left panel), (2, 100) 
(upper right), (1, 300) (lower left), and (2, 300) (lower right). 
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Figure 2: The standard deviations of /3 and /u change with respect to fi: j3o = 1, x € Xi, 10 rephcations at each x, 
{(3i, a) = (1, 25) (top panels) and (2, 49) (bottom panels). 
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5 Simulation 



The simulation with the same 16 settings as the efficiency loss study and 1000 samples is reported in Table|3] The 
number of replications is 10 over each x value. The bias and mean square error are pretty small. All coverage 
probabilities of 95% confidence interval achieve their nominal value 0.95. 



Table 3: Simulation results (the nominal value of coverage probability is 0.95). 
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49 
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0.013 


0.949 
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0.001 


0.947 
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0.006 


0.945 



6 Example 

For the jejunal crypt data in Tabled it is assumed that the surviving probabilities pi satisfy log{pi/(l — Pi)} = 
Po + PiXi for all i (e.g., Kim and Taylor 1994 and Elder et al. 1999). The R function optim is used to maximize the 
likelihood function in Q- The estimates of (3 and /i are stable, but that of a varies a lot. The estimate a is 3121.834 
when its initial value is 20, but becomes 6070.602 when its initial value is 200. The Hessian matrix is found to be 
nearly singular, which implies that the variance of a is huge. 

By the likelihood ratio test, we would like to assume that the rii arise from a single Poisson distribution. The results 
are reported in Table |3 which also shows the estimates and their standard errors using the logistic regression (with 
rii = 160), the quasi-likelihood approaches in Kim and Taylor (1994) (with E{ni) = 160) and Elder et al. (1999). 
Our estimates and standard errors are very close to those in Elder et al. (1999). All the estimates of previous methods 
fall into our 95% confidence intervals: (5.207, 8.203) for /3o, (-1.248, -1.000) for /3i and (103.4, 289.0) for ^i. The 
standard errors of /3 are pretty small, while that of ^ is quite large. Elder et al. (1999) conjectured that the variance 
inflation of pL is due to the data structure, i.e., there is no zero dose. By the estimation precision study in Section 4, 
we can not only give a more accurate explanation of the large standard error of ji, but also explains the small standard 
errors of (3. Since jl is as large as 196.2, the standard error of ^ is large, while those of (3 are small. 

7 Discussion 

One may consider estimating the rii and then apply the logistic regression. There is a lot of literature about estimating 
binomial size n under the condition that p is either known or unknown. If p is unknown, then it is usually treated as a 
nuisance parameter (e.g.. Draper and Guttman, 1971; Caroll and Lombard, 1985). Unlike the studies in the literature. 
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Table 4: The jejunal crypt data results from the proposed and previous approaches (logistic regression and Kim's 
method fix m and E{ni) at 160, respectively; Kim's and Elder's quasi-likelihood method of moments estimates come 
from Elder et al. (1999)). 





estimate (standard error) 


logistic 


Kim's Elder's 


proposed 




7.432 (0.175) 


7.410 (0.191) 6.727 (0.725) 


6.705 (0.764) 


Pi 


-1.185 (0.024) 


-1.183(0.026) -1.126(0.061) 


- 1.124 (0.063) 






— 194.7 (43.4) 


196.2(47.4) 



the Pi depend on covariates, and many need to be estimated. Such a two-stage approach also makes analysis 
uimecessarily more complicated. The proposed approach estimates all parameters in a seamless fashion by treating 
the means of the rij as nuisance parameters and integrating them out. 



Appendix: The Fisher information matrix 

Let £{0) — log f{y, x, 9) and tZ/(a) = log r(a). The first order derivatives are 
_ = ^'[a + y)- ^'{a) + log a + 1 - log{a + nh{x,/3)} 



da a + iih{x,f3) 

d£ y a + y 



|5 = i^'''''(--«-d^7^l!k^'''''^'«- 

The second order derivatives are 

d'^£ , , 1 1 y - ahix^B) 

= ^"[a + y)-'^"{a) + , ^, + ^ ^ '^^ 



da^ a a + fj,h{x,fi) {a + iih{x,(3)}'^' 

d^^_y_^ {a + y)h?{x,l3) 
dy? /i^ {a + ^/i(a7,/3)}^ ' 
d^l ( y {a + y)ii \ d^h{x,l3) 



3(3813' \h{x,l3) a + nh{x,(3)j d(3d(3' 
dH _ h{x,f3){y- fih{x,f3)} 



dadjjL {a + ^h{x, (3)}^ 

dH ii{y- iih{x,(3)} 



dadl3 {a + i^h{x,l3)y 

d'^e _ -{a + y)a 
diJ,d(3 ~ {a + iJLh{x,(3)Y 



\/(3h{x,f3), 
Vph{x,(3). 



By taking negative expectation with respect to f{y; x, 6), one obtains the Fisher information matrix. Note that E{y) = 
IJ,h{x,(3). 
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